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A new Viterbi decoder , capable of decoding Convolutional codes with constraint 
lengths up to 15, is under development for the BSN. A key feature of this decoder is a 
two-level partitioning of the Viterbi state diagram into identical subgraphs . The larger 
subgraphs correspond to circuit boards, while the smaller subgraphs correspond to 
chips. The full decoder is built from identical boards, which in turn are built from identi- 
cal chips. The resulting system is modular and hierarchical. The decoder is easy to imple- 
ment, test, and repair because it uses a single VLSI chip design and a single board design. 
The partitioning is completely general in the sense that an appropriate number of boards 
or chips may be wired together to implement a Viterbi decoder of any size greater than 
or equal to the size of the module. 



I. Introduction 

A new Viterbi decoder [1] , capable of decoding convolu- 
tional codes with constraint lengths up to 15, is under develop- 
ment for the DSN. This article describes a novel partitioning of 
the decoder’s state transition diagram that forms the basis for 
the new decoder’s architecture. 

The Viterbi algorithm is naturally fully parallel [2] . How- 
ever, a fully parallel implementation of a large constraint 
length Viterbi decoder requires an impractical amount of 
hardware. The first question to be faced when building such a 
decoder is what part of this parallelism to throw away. We 
decided to retain a fully distributed architecture for comput- 
ing and exchanging accumulated metrics, but to perform the 
arithmetic computations bit-serially. The arithmetic compu- 
tations are 16-bits long, so the decoding speed will be greater 
than 1 Mbit/sec with a 20-MHz system clock. 


In a fully distributed architecture, there are 2 K ~ l basic 
computational elements called add-compare-select circuits [1] 
for a constraint length K decoder. When K is large, it is desir- 
able to take a modular, hierarchical approach to organizing the 
huge number of required elements. Many add-compare-select 
circuits can be implemented on a single VLSI chip, and many 
chips can be mounted on a single printed circuit board. The 
full decoder is implemented by wiring together the required 
number of chips and boards. 

The main problem is wiring. How can the 2 K ~ l basic ele- 
ments, each with two inputs and an output (going to two dif- 
ferent elements’ inputs), be partitioned into chips and boards 
without using too many pins per chip or too large a board edge 
connector? This article shows first how pairs of add-compare- 
select circuits group to form elements called butterflies. The 
connection diagram of these 2 K '~ 2 butterflies is a deBruijn 
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graph [3] ; the butterflies are nodes in the graph and the edges 
of the graph represent wires between butterflies. The rest of 
the article shows how the set of butterflies can be split into 
modules called boards and the boards split into modules called 
chips, in such a way that a large proportion of the required 
connections between butterflies are implemented internally 
within the modules. The chips are all identical and the boards 
are all identical. Furthermore, their internal structure does not 
depend on the size of the decoder, and an appropriate number 
of board modules and chip modules can be wired together to 
make a decoder of any size equal to or greater than that of 
the smallest module. 

The constraint length 15 Viterbi decoder under develop* 
ment for the DSN is currently being designed with 16 boards 
and 512 chips. Each chip in this design contains 16 butterflies, 
and each board has 32 chips. However, the theory developed 
in this article is completely general and produces a modular, 
hierarchical partitioning of any size deBruijn graph into any 
number of first-level and second-level subgraphs (boards and 
chips). The exposition of the theory and the examples in this 
article are selected without reference to a specific configura- 
tion of the DSN’s new decoder. 


II. Butterflies and deBruijn Graphs 

All 2 K ~ X states in a constraint length K Viterbi decoder are 
labeled with (K - l)-bit binary strings. An add-compare-select 
circuit takes as inputs the accumulated metrics of two states 
whose labels differ only in the rightmost bit. Each of these 
accumulated metrics has a different branch metric added to it 
and the smaller of the two sums is selected. 

Two add-compare-select circuits take inputs from the same 
pair of states. The output of one of these goes to a state 
obtained by discarding the rightmost bit of the input states 
and prefixing a 0 on the left. The output of the other add- 
compare-select circuit goes to the state defined similarly but 
with a prefixed 1 instead of 0. These two add-compare-select 
circuits group to form a butterfly , depicted in Fig. 1. The 
butterfly has two input wires and two output wires for trans- 
mission of accumulated metrics. The butterfly needs only four 
wires, because its two add-compare-select circuits get their 
inputs from the same pair of states. Also, it can be shown [4] 
that a butterfly’s two add-compare-select circuits can share the 
same hardware for computing branch metrics. These facts 
make butterflies natural elements to work with. 

A butterfly is labeled by dropping the rightmost bit of the 
label of either of its input states. The butterfly connection 
diagram is a deBruijn graph with 2 K ~ 2 nodes. Each node in 
this graph is labeled by a (K - 2)-bit binary string and each 


edge is labeled by a (K - l)-bit binary string. 1 Each node is 
connected to four other nodes via four directed edges. A node 
receives its inputs via the pair of edges obtained by appending 
a 0 or 1 to the right of the node’s label, and it sends its out- 
puts via the pair of edges obtained by prefixing a 0 or 1 to the 
left of the node’s label. A diagram of the connections for an 
arbitrary butterfly is given in Fig. 2. 

III. Wiring Approaches 

The full deBruijn graph of 2 K ~ 2 butterflies requires exactly 
2 k ~ x wires for the exchange of accumulated metrics. This 
total number of connections cannot be increased or reduced 
by any wiring scheme. However, it is advantageous to capture 
as many of these required connections as possible within iden- 
tical, small, modular units (chips and boards). Wires internal to 
modules can be implemented by duplicating the small module’s 
simple wiring diagram, while external wires between modules 
must be implemented wire -by -wire. 

One mathematically appealing way of creating identical 
modular units that incorporate a reasonable proportion of 
internal wires is to exploit one of the Hamiltonian paths [3] 
of the deBruijn graph. One of the two outputs of each butter- 
fly is connected to one of the two inputs of another butterfly 
in a big ring (Fig. 3a). This ring contains all of the butterflies 
and half of their connections. The remaining half of the con- 
nections form an irregular pattern across the interior of the 
ring, as illustrated in Fig. 3(a). Identical modules can be con- 
structed by slicing the Hamiltonian ring into equal-size linear 
segments (Fig. 3b). Almost half of the wires required for 
accumulated metric exchange can be implemented internally 
within the modules. 

A second wiring approach is based on FFT-type connection 
patterns. Modules (chips and boards) are constructed from dis- 
joint subsets of butterflies called roots. Each module contains 
its root butterflies, first-generation descendants of these roots, 
descendants of these descendants, and so forth, The descen- 
dants of a butterfly are the two butterflies to which it sends its 
outputs. The module contains all descendants at each genera- 
tion except those that are roots of another module. 

If a set of 2 b root butterflies is consecutive in the last b bits 
(i.e., the last b bits take on all possible values and all other bits 
are the same), then their descendants through b generations 
are a block of butterflies obtained by cyclic shifting the roots 
by b bits or less. A module containing the roots and all of 


J In the remainder of this article, the terms butterfly, node, butterfly 
label , and node label will be used interchangeably, as will the terms 
state , wire , edge, state label , and edge label. 
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these descendants would have ( b + 1)2 b butterflies and the 
same connection pattern as an ordinary signal processing FFT 
of (b + 1) stages, as shown in Fig. 4. This cyclic shifting may, 
however, generate one of the input strings. Unfortunately, it is 
impossible to completely partition any deBruijn graph into 
non-overlapping full-FFT modules. A module’s connection dia- 
gram must be punctured at those nodes corresponding to root 
nodes of another module. The result is a crenellated-FFT con- 
nection pattern, a subgraph of a full FFT. 

If the root butterflies are selected wisely, most of the 
full decoder’s 2 K ~ 2 butterflies are found in some module’s 
crenellated-FFT diagram. However, some butterflies do not 
belong to any crenellated FFT. These butterflies are free in 
the sense that their wiring is not specified by the crennellated- 
FFT construction. The free butterflies must physically reside 
within modules, but their connections to other butterflies 
must be implemented by external wiring (outside the modules), 
or else the modules’ internal wiring would not be identical. 

A module based on the crenellated-FFT construction thus 
contains two types of butterflies. The majority of butterflies 
belong to a crenellated-FFT pattern, and some or all of their 
required connections are implemented by internal wiring 
(within the module) which is identical from module to module. 
The remaining free butterflies typically have no internal con- 
nections, but instead communicate via four external pins (two 
for input and two for output). The pin reductions which free- 
butterfly interconnections make possible are trivial. 

For the DSN’s new Viterbi decoder, the set of root butter- 
flies is taken to be the set of all 2 K ~ A butterflies having the 
common prefix 10. This selection of root butterflies works 
well (i.e., captures a large fraction of wires within modules) for 
module sizes from 2 4 * to about 2 9 butterflies. 2 The full block 
of root butterflies is subdivided into consecutive blocks of 
roots for board modules, which are further subdivided into 
consecutive blocks of roots for the chip modules on each 
board. The crenellated FFTs generated from these root butter- 
flies are hierarchical in the sense that the crenellated FFT for 
the board is constructed without breaking any of the connec- 
tions in the crenellated FFTs for the chips on the board. 

A single shift of a string having 10 as a prefix cannot pro- 
duce another string having 10 as a prefix. Hence, for modules 
constructed from B 0 = 2 b consecutive root butterflies with 
the prefix 10 , the number B x of first -generation descendants in 


2 01 would have been an equally good choice, but not 00 or 11. Other 

prefixes (such as 100) or combinations of prefixes (such as 100,1101 ) 

work better for larger modules. A full discussion of the efficiencies of 

various root selections is beyond the scope of this article. 


the crenellated FFT equals the number of roots B 0 . The num- 
ber of butterflies B g in each succeeding generation, g , of the 
crenellated FFT is given by the linear recurrence 


for 2 < g < b - log 2 B 0 . The module only contains descen- 
dants through the &th generation; ( b + l)th-generation descen- 
dants cannot be included because their parent nodes belong to 
two different modules. It can be shown by evaluating the 
recursion formula that the number of free butterflies is b + 3 
and the total number of butterflies in the module (free butter- 
flies plus butterflies in the crenellated FFT) is 2^ 2 or four 
times the number of roots. The number of external wires 3 
leading off the module is 2* +2 + 4 (b + 3), an average of 1 
+ (b + 3)2 external wires per butterfly on the module. 

Figure 5 shows the connection diagram for a 32-butterfly 
chip module based on roots with the prefix 10. The crenel- 
lated FFT is on the left and the six free butterflies on the right 
have all their wires leading off chip. The crenellated FFT for 
the chip starts with eight root butterflies and continues for 
three generations of descendants from these roots. The crenel- 
lated FFT resembles an ordinary 8 X 4-stage FFT, except for 
punctures eliminating six of the nodes. The number of exter- 
nal wires per 32-butterfly chip is 56. 

Figure 6 shows the connection pattern for a 512-butterfly 
board module based on roots with the prefix 10. The crenel- 
lated FFT contains 128 roots and 7 generations of descendants. 
The 128 X 8-stage ordinary FFT template is obvious, even 
though over half the nodes from this template are missing in 
the crenellated version. The crenellated-FFT structure includes 
502 of the board module’s 512 butterflies, leaving just 10 free 
butterflies per board. The number of external wires per 512- 
butterfly board is 552, just over 1 wire per butterfly (about 
half as many external wires as for a same-size module based on 
the Hamiltonian path construction). 

Figures 5 and 6 illustrate how the definition of the first- 
level subgraph (a board) is completely consistent with the 
definition of the second-level subgraph (a chip). The 512- 
butterfly board in Fig. 6 is built from sixteen of the 32- 
butterfly chips in Fig. 5. In Fig. 6 arrows correspond to chip 
pins, and unconnected arrows represent board pins (which 
must be connected to pins on other boards via the backplane). 
Heavy lines represent wires on the board between chip pins, 


3 External wire and pin counts quoted in this article refer only to the 
wires required for exchange of accumulated metrics and do not include 
additional wires and pins needed for power and so forth. 
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and thin lines represent internal connections within the chip. 
Pictorially, the crenellated-FFT portions of eight of the six- 
teen chips in Fig. 6 are identical copies of the crenellated-FFT 
portion of the chip in Fig. 5, and the crenellated-FFT portions 
of the other eight chips are depicted by their mirror images 
(for convenience of display). Similarly, the depictions of the 
six free butterflies in each chip are displaced horizontally by 
varying amounts to emphasize the crenellated-FFT structure 
of the board. 

The hierarchical nature of the crenellated-FFT construction 
holds not just for 32-butterfly chips and 5 12-butterfly boards 
but also for all other module sizes 2 b+2 . Each module con- 
structed from consecutive roots with the prefix 10 can be 
built from two modules half its size constructed from the same 
type of roots. 


IV. Butterfly Addressing 

Each butterfly, described by a {K - 2)-bit binary string, 
must be assigned a (K - 2)-bit address or location. The full 
address specifies the butterfly’s exact position in the modular 
hierarchy. The most significant bits of the address correspond 
to the butterfly’s board and chip location. For example, in 
a 2 4 -board/2 8 -chip configuration for a constraint length 15 
decoder (2 13 total butterflies), the four most significant bits 
of the address specify the board, and the next four bits specify 
the chip within a board. The five least significant bits of the 
address specify the position of the butterfly within a chip. 

The addressing formula is somewhat arbitrary, but it must 
satisfy two basic conditions: (1) it must be a one-to-one map- 
ping from ( K - 2)-bit butterflies to ( K - 2)-bit addresses and 
(2) it must be consistent with the partition of the deBruijn 
graph into crenellated FFTs, i.e., all butterflies assigned to 
certain chip and board locations by the crenellated-FFT con- 
struction should be mapped to those same locations by the 
addressing formula. Free butterflies may be mapped to any 
convenient free address. 

The specification of a butterfly’s (K - 2)-bit address pro- 
ceeds as follows. First, compute the butterfly’s partial address 
by dropping from its (K - 2)-bit label all of the most signifi- 
cant bits through and including the first occurrence of the 
string 10 . The partial address consists of all the bits to the 
right of the first 10 , and it is empty if there is no occurrence 
of 10 in the butterfly’s ( K - 2)-bit label or if 10 first occurs 
in the two least significant bits. The partial address is the only 
part of the full address that is specified by the crenellated-FFT 
partition. For example, in a 2 8 -chip decoder, a partial address 
of 8 bits will determine exactly which chip a given butterfly 
belongs to, but a butterfly with a partial address of 7 bits or 


less is one of the free butterflies that is not assigned to any 
chip’s crenellated FFT. 

The partial address sets the most significant bits of a butter- 
fly’s full address. The remaining part of the address, called 
arbitrary bits , is completely arbitrary in the sense that any 
choice will be consistent with the crenellated-FFT construc- 
tion. However, the arbitrary bits for all butterflies must be 
chosen in a way that assigns each (K - 2)-bit butterfly to a 
unique ( K - 2)-bit address. One simple rule for guaranteeing 
a one-to-one mapping is to choose the arbitrary bits as the 
reversal of the most significant bits (through and including the 
first occurrence of 10) that were dropped to extract the partial 
address. Then, 

butterfly = {prefix , partial address) 

= (p {suffix ), partial address ) 

address - {partial address, suffix) 

= {partial address , p{prefix)) 

where suffix are the arbitrary bits and prefix are the most 
significant bits of butterfly up to and including the first occur- 
rence of 10. The notations p(prefix) and p{suffix ) denote the 
reversals of the indicated bit strings. For example, butterfly = 
{abcde 10, fghijk) gives address - {fghijk filedcba), assuming 
that abcde does not contain the string 10. 

This rule produces a one-to-one mapping because it is 
obviously invertible. Given any {K -2)-bit address, first deter- 
mine the partial address by dropping all of the least significant 
bits through and including the last occurrence of 01. The 
dropped bits are the arbitrary bits. Now compute the unique 
butterfly label corresponding to that address by concatenating 
the reverse of the arbitary bits with the partial address. 


V. Making Full Decoders from Chips 
and Boards 

The board and chip modules defined by the crenellated-FFT 
construction have the property that full Viterbi decoders of 
all sizes at least equal to the size of the module can be con- 
structed by appropriately connecting identical copies of the 
module, without revising the internal wiring within any 
module. Figure 7 shows a 32-butterfly chip wired as a con- 
straint length 7 decoder, and Fig. 8 shows two 32-butterfly 
chips wired as a constraint length 8 decoder. Arrows corres- 
pond to chip pins and heavy lines represent external wires 
between chip pins. Thin lines represent internal connections 
within the chip. Note that many of the heavy lines in Fig. 8 
connect butterflies within the same chip, as do all the heavy 


96 



i 


lines in Fig. 7. However, these connections cannot be incorpo- 
rated internally within the chip, because the chip would no 
longer be a universal module, i.e., some larger constraint 
length decoder could not be built from the more tightly 
wired chips. 


VI. Bounds, Improvements, and Further 
Applications 

There exist lower bounds [4] on the number of edges 
crossing cuts which divide the nodes of a deBruijn graph 
into sets of equal or almost equal cardinality. These follow 
from the very small number of short cycles in the graph and 
do not depend on the sets having identical internal connec- 
tions. The present board design is less than a factor of two 
away from these bounds. 

Chip and board modules may include some additional inter- 
nal connections if they are destined only for a particular size 
of decoder (e.g., just the constraint length 15 decoder). Also, 
by restricting the decoder to constraint lengths 15 and larger 
and allowing one of the boards to be different from the others, 
the number of wires between boards can be reduced without 


changing the chips. These facts offer some flexibility if the 
backplane presents unexpected wiring problems. 

There are additional applications for these results unrelated 
to building Viterbi decoders. For example, the modular decom- 
position of the deBruijn graph might be useful for building 
very big spectrum analyzers and multipliers based on the 
Schronager-Strassen algorithm [5] . 


VII. Summary 

A novel partition of the deBruijn graph inspired by the 
problem of building a large constraint length Viterbi decoder 
has been introduced. The full decoder is built from identical 
subgraphs called boards, which in turn are built from identical 
subgraphs called chips. The system is modular and hierarchical, 
and it implements a large proportion of the required wiring 
internally within modules. This results in a simpler design, 
reduced cost, and improved testability and repairability. A 
constraint length 15 decoder that uses 512 identical VLSI 
chips and 16 identical printed circuit boards based on this 
partitioning is a feasible design for decoding at a speed of 
1 Mbit/sec. 
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Fig. 7. A 32-butterfly chip, wired as a K = 7 decoder. 
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